Individual Project 2: White and Red Wine Data Sets

DATS 6103 - Individual Project 2 - Lucia Illari (G24308643)

DOI

Wine is increasinly enjoyed by a wider and wider audience, and Portugual in particular comes in at 9th for the top wine exporting countries (source). Focusing on vinho verde, exports of this wine suprassed 50% of total sales in 2019 and France, Canada, the United Kingdom, and Japan are the biggest buyers (source). What makes this wine so popular? What physicochemical qualities contribute to its sensory quality? In essence, what makes these wines so good?

Thus, the purpose of this project is to delve into understanding the physicochemical and sensory characteristics of red and white variants of the Portuguese "Vinho Verde" wine. The end goal will be to determine if these characteristics cannot be used to predict the quality of the wine. Unfortunately, due to privacy and logistic issues, there is no data about grape types, wine brand, wine selling price, etc.

The original dataset can be found over here. There are 1599 red wine instances and 4898 white wine instances, with 11 features and an output column called quality. This sensory preference column could theoretically range from 1 to 10, but in reality the values range from 3 to 9. The input variables are:

  • fixed acidity
  • volatile acidity
  • citric acid
  • residual sugar
  • chlorides
  • free sulfur dioxide
  • total sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol

As red and white wine have different flavor profiles, it makes sense to keep the two datasets separate, rather than merging them. What makes a good white wine might make a terrible red wine. Further, several of the attributes may be correlated, so it would make sense to explore feature selection down the line.

Section 0) Importing and cleaning the data

First thing's first, let us import the necessary packages as well as the data:

In [1]:
#importing necessary packages
import numpy as np
import pandas as pd
import seaborn as sns
import tkinter as tk
from matplotlib.figure import Figure 
from matplotlib.backends.backend_tkagg import (FigureCanvasTkAgg, NavigationToolbar2Tk)
import matplotlib.pyplot as plt
from tabulate import tabulate
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
import plotly.express as px
from kneed import KneeLocator
from sklearn.metrics import silhouette_score
from scipy.signal import find_peaks
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

#set seed
random_seed = 42

%matplotlib notebook
#%matplotlib inline

#want to ignore warnings
import warnings
warnings.filterwarnings('ignore')
In [2]:
#importing in the datasets using pandas
#looking at the dataset we see the delimiter is actually a semicolon instead of a comma
red = pd.read_csv('winequality-red.csv', sep=';')
white = pd.read_csv('winequality-white.csv', sep=';')

white.head()
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6

Let's check for any missing data:

In [3]:
#if the dataframe has an empty string, we want to replace it with NaN
white.replace('', np.nan, inplace=True)
red.replace('', np.nan, inplace=True)

#we want to print out the total number of NaNs in the dataframe
print("There are",white.isnull().sum().sum(),"NaN elements in the white wine dataset.")
print("There are",red.isnull().sum().sum(),"NaN elements in the red wine dataset.")
There are 0 NaN elements in the white wine dataset.
There are 0 NaN elements in the red wine dataset.

Great, there is no missing data in either of the datasets.

Now the quality column is numeric, so I want to go ahead and create a categorical column that has binned the values into quality groups. There are a lot of average wines (5, 6 ratings) in this data set so we want to understand what makes the very best stand out, and make a cut-off of 7 and up be a "good" wine, and 6 and below an "okay" or "bad" wine, like so:

category values
bad 1-6
good 7-10

We could do a more fine grained rating, such as very poor, poor, average, good, excellent or even just poor, average, good, but we want to get at the heart of what separates the good ones from all the averages, especially since there are not very many good or poor wines in either dataset.

In [4]:
print("For the white wine dataset, for each sensory rating, here are the counts:\n",white['quality'].value_counts())
print("\nFor the red wine dataset, for each sensory rating, here are the counts:\n",red['quality'].value_counts())
For the white wine dataset, for each sensory rating, here are the counts:
 6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

For the red wine dataset, for each sensory rating, here are the counts:
 5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

For both datasets there are 7 and 8 ratings, though only for the white wine dataset are there 9 ratings. These does mean however that our classes are severely unbalanced.

Regardless, let's go ahead and make our good/bad columns.

In [5]:
#fill the new column with no data
white['good_bad'] = np.nan
red['good_bad'] = np.nan

#replace the data in the new column
for i in list(range(0,len(white))):
    if white['quality'][i] <= 6:
        white['good_bad'][i] = 'bad'
    if white['quality'][i] > 6:
        white['good_bad'][i] = 'good'
        
for i in list(range(0,len(red))):
    if red['quality'][i] <= 6:
        red['good_bad'][i] = 'bad'
    if red['quality'][i] > 6:
        red['good_bad'][i] = 'good'
        
#want to move the good_bad column to be the first column
cols = white.columns.tolist()
cols = cols[-1:] + cols[:-1]
white = white[cols]
red = red[cols]

#want to view the data
white.head()
Out[5]:
good_bad fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 bad 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 bad 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 bad 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 bad 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 bad 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6

We can no go ahead and deal with our data.

Section 1) EDA

i) Descriptive statistics

Let's go ahead and group our data, and then look at the mean and median for the two different groups.

In [6]:
#grouping the data
gr_yt = white.iloc[:, 0:(len(cols)-1)].groupby(['good_bad'])
gr_red = red.iloc[:, 0:(len(cols)-1)].groupby(['good_bad'])
In [7]:
#we want to know the counts for each group
print("There are "+str(gr_yt.size()[0])+" bad white wines and "+str(gr_yt.size()[1])+" good white wines, meaning "+str(round((gr_yt.size()[1])/(gr_yt.size()[0])*100,2))+"% of white wines are 'good'.")
print("There are "+str(gr_red.size()[0])+" bad white wines and "+str(gr_red.size()[1])+" good white wines, meaning "+str(round((gr_red.size()[1])/(gr_red.size()[0])*100,2))+"% of white wines are 'good'.")
There are 3838 bad white wines and 1060 good white wines, meaning 27.62% of white wines are 'good'.
There are 1382 bad white wines and 217 good white wines, meaning 15.7% of white wines are 'good'.
In [8]:
#print out the mean and median values
print("Mean values for white whine:")
print(tabulate(gr_yt.mean(), headers='keys', tablefmt='fancy_grid'))
print("\n")
print("Mean values for red whine:")
print(tabulate(gr_red.mean(), headers='keys', tablefmt='fancy_grid'))
print("\n")
print("Median values for white whine:")
print(tabulate(gr_yt.median(), headers='keys', tablefmt='fancy_grid'))
print("\n")
print("Median values for red whine:")
print(tabulate(gr_red.median(), headers='keys', tablefmt='fancy_grid'))
Mean values for white whine:
╒════════════╤═════════════════╤════════════════════╤═══════════════╤══════════════════╤═════════════╤═══════════════════════╤════════════════════════╤═══════════╤═════════╤═════════════╤═══════════╕
│ good_bad   │   fixed acidity │   volatile acidity │   citric acid │   residual sugar │   chlorides │   free sulfur dioxide │   total sulfur dioxide │   density │      pH │   sulphates │   alcohol │
╞════════════╪═════════════════╪════════════════════╪═══════════════╪══════════════════╪═════════════╪═══════════════════════╪════════════════════════╪═══════════╪═════════╪═════════════╪═══════════╡
│ bad        │         6.89059 │           0.281802 │      0.336438 │          6.70348 │   0.0478747 │               35.5173 │                141.983 │  0.994474 │ 3.18085 │    0.487004 │   10.2652 │
├────────────┼─────────────────┼────────────────────┼───────────────┼──────────────────┼─────────────┼───────────────────────┼────────────────────────┼───────────┼─────────┼─────────────┼───────────┤
│ good       │         6.72514 │           0.265349 │      0.326057 │          5.26151 │   0.0381604 │               34.5505 │                125.245 │  0.992412 │ 3.21513 │    0.500142 │   11.416  │
╘════════════╧═════════════════╧════════════════════╧═══════════════╧══════════════════╧═════════════╧═══════════════════════╧════════════════════════╧═══════════╧═════════╧═════════════╧═══════════╛


Mean values for red whine:
╒════════════╤═════════════════╤════════════════════╤═══════════════╤══════════════════╤═════════════╤═══════════════════════╤════════════════════════╤═══════════╤═════════╤═════════════╤═══════════╕
│ good_bad   │   fixed acidity │   volatile acidity │   citric acid │   residual sugar │   chlorides │   free sulfur dioxide │   total sulfur dioxide │   density │      pH │   sulphates │   alcohol │
╞════════════╪═════════════════╪════════════════════╪═══════════════╪══════════════════╪═════════════╪═══════════════════════╪════════════════════════╪═══════════╪═════════╪═════════════╪═══════════╡
│ bad        │         8.23683 │           0.547022 │      0.254407 │          2.51212 │   0.0892808 │               16.1722 │                48.2858 │  0.996859 │ 3.31462 │    0.644754 │    10.251 │
├────────────┼─────────────────┼────────────────────┼───────────────┼──────────────────┼─────────────┼───────────────────────┼────────────────────────┼───────────┼─────────┼─────────────┼───────────┤
│ good       │         8.847   │           0.40553  │      0.376498 │          2.70876 │   0.0759124 │               13.9816 │                34.8894 │  0.99603  │ 3.2888  │    0.743456 │    11.518 │
╘════════════╧═════════════════╧════════════════════╧═══════════════╧══════════════════╧═════════════╧═══════════════════════╧════════════════════════╧═══════════╧═════════╧═════════════╧═══════════╛


Median values for white whine:
╒════════════╤═════════════════╤════════════════════╤═══════════════╤══════════════════╤═════════════╤═══════════════════════╤════════════════════════╤═══════════╤══════╤═════════════╤═══════════╕
│ good_bad   │   fixed acidity │   volatile acidity │   citric acid │   residual sugar │   chlorides │   free sulfur dioxide │   total sulfur dioxide │   density │   pH │   sulphates │   alcohol │
╞════════════╪═════════════════╪════════════════════╪═══════════════╪══════════════════╪═════════════╪═══════════════════════╪════════════════════════╪═══════════╪══════╪═════════════╪═══════════╡
│ bad        │             6.8 │               0.27 │          0.32 │            6     │       0.045 │                    34 │                    140 │   0.99438 │ 3.17 │        0.47 │      10   │
├────────────┼─────────────────┼────────────────────┼───────────────┼──────────────────┼─────────────┼───────────────────────┼────────────────────────┼───────────┼──────┼─────────────┼───────────┤
│ good       │             6.7 │               0.25 │          0.31 │            3.875 │       0.037 │                    33 │                    122 │   0.99173 │ 3.2  │        0.48 │      11.5 │
╘════════════╧═════════════════╧════════════════════╧═══════════════╧══════════════════╧═════════════╧═══════════════════════╧════════════════════════╧═══════════╧══════╧═════════════╧═══════════╛


Median values for red whine:
╒════════════╤═════════════════╤════════════════════╤═══════════════╤══════════════════╤═════════════╤═══════════════════════╤════════════════════════╤═══════════╤══════╤═════════════╤═══════════╕
│ good_bad   │   fixed acidity │   volatile acidity │   citric acid │   residual sugar │   chlorides │   free sulfur dioxide │   total sulfur dioxide │   density │   pH │   sulphates │   alcohol │
╞════════════╪═════════════════╪════════════════════╪═══════════════╪══════════════════╪═════════════╪═══════════════════════╪════════════════════════╪═══════════╪══════╪═════════════╪═══════════╡
│ bad        │             7.8 │               0.54 │          0.24 │              2.2 │       0.08  │                    14 │                   39.5 │   0.9968  │ 3.31 │        0.6  │      10   │
├────────────┼─────────────────┼────────────────────┼───────────────┼──────────────────┼─────────────┼───────────────────────┼────────────────────────┼───────────┼──────┼─────────────┼───────────┤
│ good       │             8.7 │               0.37 │          0.4  │              2.3 │       0.073 │                    11 │                   27   │   0.99572 │ 3.27 │        0.74 │      11.6 │
╘════════════╧═════════════════╧════════════════════╧═══════════════╧══════════════════╧═════════════╧═══════════════════════╧════════════════════════╧═══════════╧══════╧═════════════╧═══════════╛

There isn't an interesting difference for every group - for example the density values for red and white wine are very similar, regardless if it's a good or bad wine, but for total sulfar dioxide, it appears the bad wines for white and red have higher values than the good wines, and from white to red there is almost a 100 point decrease in these values! Further, for both types of wine, the better wines have both higher mean and median values for alcohol - looks like the better the wine, the more boozy it is... This gives us an idea of what attributes might be less or more important later on, and that they might be different for the two wine categories. We might exclude free sulfur dioxide from our models for white wine, but include it for red wine, just looking at these values.

But looking at this information just in tables is a little boring, so let's make an interactive GUI to view the descriptive statistics.

In [9]:
def start_gui():
    #the main window
    root = tk.Tk()
    #setting size
    root.geometry("1000x825")
    #not allowing resizing
    root.resizable(0, 0)
    
    #allows me to group and organize widgets
    frame1 = tk.Frame(root)
    frame2 = tk.Frame(root)
    frame3 = tk.Frame(root)
    frame4 = tk.Frame(root)
    
    #window title
    root.title('Descriptive statistics for vinho verde datasets') 
    
    #generate some labels
    lbl1 = tk.Label(frame1, text = "Wine choice:")
    lbl1.pack()
    
    #functions nested like this because keeping track of buttons pressed 
    def wine_choice(opt):
        #functions determining for which columns to output descriptive statistics
        def describe(colm):
            if opt == 'white':
                res = white[colm].describe()
            else:
                res = red[colm].describe()
            #displaying statistics
            txt = "\nDescriptive statistics for {0} wine, {1}:\n\n{2}"
            lbl2 = tk.Label(frame3, text = txt.format(opt,colm,res))
            lbl2.pack()
            
            def b_plot(): 
                #figure that will contain the plot 
                fig = Figure(figsize = (5, 5), dpi = 75) 
                #setting the subplot grid parameters
                p1 = fig.add_subplot(211) 
                p2 = fig.add_subplot(212)
                
                if opt == 'white': 
                    grouping = white[['good_bad',colm]].groupby(['good_bad'])
                    dat_g = grouping.get_group('good')
                    dat_g = dat_g[colm]
                    dat_b = grouping.get_group('bad')
                    dat_b = dat_b[colm]
                    col = 'black'
                else:
                    grouping = red[['good_bad',colm]].groupby(['good_bad'])
                    dat_g = grouping.get_group('good')
                    dat_g = dat_g[colm]
                    dat_b = grouping.get_group('bad')
                    dat_b = dat_b[colm]
                    col = 'red'
                p1.hist(dat_g, weights=np.ones(len(dat_g)) / len(dat_g), color = col)
                p1.set_xlabel(colm)
                p1.set_ylabel('probability')
                title = 'Histogram of {0} for good {1} wine'
                p1.set_title(title.format(colm,opt))
                
                p2.hist(dat_b, weights=np.ones(len(dat_b)) / len(dat_b), color = col)
                p2.set_xlabel(colm)
                p2.set_ylabel('probability')
                title = 'Histogram of {0} for bad {1} wine'
                p2.set_title(title.format(colm,opt))
                fig.tight_layout()
            
                #creating the canvas containing figure and placing on the window 
                canvas = FigureCanvasTkAgg(fig, root)   
                canvas.draw() 
                canvas.get_tk_widget().pack(side="top")
                
                btn_r = tk.Button(frame4, text="Restart", width=10, height=2)
                btn_r.pack(side="bottom")
                btn_r.bind('<Button-1>', lambda e: click('YES'))
                
                def click(yn):
                    btn_r.config(command=restart())
            
            #button for plot
            btn_p = tk.Button(frame3, command = b_plot, width=10, height=3, text = "Histogram").pack()
        
        lbl3 = tk.Label(frame2, text = "Pick an attribute to investigate:")
        lbl3.pack()
    
        #spawn attribute buttons after user chooses a wine
        #generate buttons
        btn3 = tk.Button(frame2, text='fixed acidity', width=10, height=3)
        btn3.pack(side="left")
        #this keeps track of which button is pressed
        btn3.bind('<Button-1>', lambda e: describe('fixed acidity'))
    
        btn4 = tk.Button(frame2, text='volatile\nacidity', width=10, height=3)
        btn4.pack(side='left')
        btn4.bind('<Button-1>', lambda e: describe('volatile acidity'))

        btn5 = tk.Button(frame2, text='citric\nacid', width=10, height=3)
        btn5.pack(side='left')
        btn5.bind('<Button-1>', lambda e: describe('citric acid'))

        btn6 = tk.Button(frame2, text='residual\nsugar', width=10, height=3)
        btn6.pack(side='left')
        btn6.bind('<Button-1>', lambda e: describe('residual sugar'))

        btn7 = tk.Button(frame2, text='chlorides', width=10, height=3)
        btn7.pack(side='left')
        btn7.bind('<Button-1>', lambda e: describe('chlorides'))

        btn8 = tk.Button(frame2, text='free\nsulfur\ndioxide', width=10, height=3)
        btn8.pack(side='left')
        btn8.bind('<Button-1>', lambda e: describe('free sulfur dioxide'))

        btn9 = tk.Button(frame2, text='total\nsulfur\ndioxide', width=10, height=3)
        btn9.pack(side='left')
        btn9.bind('<Button-1>', lambda e: describe('total sulfur dioxide'))

        btn10 = tk.Button(frame2, text='density', width=10, height=3)
        btn10.pack(side='left')
        btn10.bind('<Button-1>', lambda e: describe('density'))

        btn11 = tk.Button(frame2, text='pH', width=10, height=3)
        btn11.pack(side='left')
        btn11.bind('<Button-1>', lambda e: describe('pH'))

        btn12 = tk.Button(frame2, text='sulphates', width=10, height=3)
        btn12.pack(side='left')
        btn12.bind('<Button-1>', lambda e: describe('sulphates'))

        btn13 = tk.Button(frame2, text='alcohol', width=10, height=3)
        btn13.pack(side='left')
        btn13.bind('<Button-1>', lambda e: describe('alcohol'))

        btn14 = tk.Button(frame2, text='quality', width=10, height=3)
        btn14.pack(side='left')
        btn14.bind('<Button-1>', lambda e: describe('quality'))

    #buttons for wine choices
    btn1 = tk.Button(frame1, text = "white", width=10, height=2)
    btn1.pack(side='left')
    #remember which button user picks
    btn1.bind('<Button-1>', lambda e: wine_choice('white'))

    btn2 = tk.Button(frame1, text = "red", width=10, height=2)
    btn2.pack(side='left')
    btn2.bind('<Button-1>', lambda e: wine_choice('red'))
    
    
    def restart():
        root.destroy()
        start_gui()

    frame1.pack(padx=1,pady=1)
    frame2.pack(padx=1,pady=1)
    frame3.pack(padx=1,pady=1)
    frame4.pack(padx=1,pady=1)

    #must be called for window to be drawn and events to be processed
    root.mainloop()
In [54]:
start_gui()

ii) Boxplots

This is another way to visualize the information in the mean and median tables, essentially.

In [11]:
for i in cols[1:(len(cols)-1)]:
    fig, ax = plt.subplots(1,2, sharey = True, figsize=(15,5))
    
    p1 = sns.boxplot(x="good_bad", y=i,data=white, palette="cool", ax=ax[0], showmeans=True);
    p1.set_title("Box plot for "+i+" for white wine")
    p1.set(xlabel='quality of wine', ylabel=i)
    
    p2 = sns.boxplot(x="good_bad", y=i, data=red, palette="OrRd_r", ax=ax[1], showmeans=True);
    p2.set_title("Box plot for "+i+" for red wine")
    p2.set(xlabel='quality of wine', ylabel=i)

iii) Correlations

Now to check to see what attributes are highly correlated, and I am particularly interested to see if there are any highly correlated attributes with the quality of the wine.

In [53]:
corr_yt = white.corr(method='pearson')
corr_red = red.corr(method='pearson')
#print(tabulate(corr_yt, headers='keys', tablefmt='fancy_grid'))

fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))

hm1 = sns.heatmap(corr_yt,center=0, cmap=sns.diverging_palette(20, 220, n=200), square=False, annot=True, ax=ax[0], linewidths=.5)
hm1.set_xticklabels(hm1.get_xticklabels(),rotation=45,horizontalalignment='right');
#strange cut-oof of top and bottom rows that requires manually setting the y-axis
hm1.set_ylim(len(cols)-1, -0.5);
hm1.set_title("Correlation of features for white wine");

hm2 = sns.heatmap(corr_red,center=0,cmap=sns.diverging_palette(20, 220, n=200),square=False, ax=ax[1], annot=True, linewidths=.5)
hm2.set_xticklabels(hm2.get_xticklabels(),rotation=45,horizontalalignment='right');
hm2.set_ylim(len(cols)-0.5, -0.5);
hm2.set_title("Correlation of features for red wine");

Looks like alcohol is moderately positively correlated with quality and very strongly negatively correlated with denisty, while density is weakly negatively correlated with quality and very strongly positively correlated with residual sugars.

iv) Scatterplots

Let's just quickly look at some scatterplots.

In [13]:
sns.pairplot(white.iloc[:, 0:12], diag_kind="kde", kind="hist", hue="good_bad");

Often times the good and bad quality wines are on top of each other, or the bad quality wine has a tail of wines that break off from the pack. Using unsupervised machine learning to find clusters on this data might be difficult, and might prove more fruitful if we first reduce the dimensionality via PCA.

Section 2) PCA

I'm interested in reducing the dimensionality of the dataset. By reducing the number of features, we will speed up our algorithms and improve their performance. Further, decreasing the number of attributes should reduce the noise. First, we need to standardize our data

In [14]:
features = cols[1:13]

#separating out the features
#in this case I will keep the `quality` feature because this is unsupervized
x_yt = white.loc[:, features].values
x_red = red.loc[:, features].values

#separating out the target
y_yt = white.loc[:,['good_bad']].values
y_red = red.loc[:,['good_bad']].values

#standardizing the features
scal = StandardScaler()
x_yt_t = scal.fit_transform(x_yt)
x_red_t = scal.fit_transform(x_red)

Next we need to determine how many components we are keeping.

In [15]:
#transforming the features
pca = PCA()
pca_yt = pca.fit(x_yt_t)
x_yt_t = pca.transform(x_yt_t)
pca_red = pca.fit(x_red_t)
x_red_t = pca.transform(x_red_t)

#plotting the explained variance by components
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
l1 = sns.lineplot(data = pca.fit(x_yt_t).explained_variance_ratio_.cumsum(), ax=ax[0], marker='o', color='black', linewidth=2.5);
l1.set_title("Exlained variance by components for white wine");
l1.set(xlabel='Number of components', ylabel='Cumulative explained variance');

l2 = sns.lineplot(data = pca.fit(x_red_t).explained_variance_ratio_.cumsum(), ax=ax[1], marker='o', color='red', linewidth=2.5);
l2.set_title("Exlained variance by components for red wine");
l2.set(xlabel='Number of components', ylabel='Cumulative explained variance');

We want to preserve at least 80% of the variance so for both wines we will keep 6 components. Now we can perform PCA.

In [16]:
#keeping 6 components
pc_yt = PCA(n_components=6).fit_transform(x_yt_t)
pc_red = PCA(n_components=6).fit_transform(x_red_t)

pdf_yt = pd.DataFrame(data = pc_yt, columns = ['pc1', 'pc2','pc3', 'pc4','pc5','pc6'])
pdf_red = pd.DataFrame(data = pc_red, columns = ['pc1', 'pc2','pc3', 'pc4','pc5','pc6'])

fdf_yt = pd.concat([pdf_yt, white[['good_bad']]], axis = 1)
fdf_red = pd.concat([pdf_red, red[['good_bad']]], axis = 1)

Great, let's go ahead and investigate the effects of each feature on each component.

In [51]:
x_yt_s = scal.fit_transform(x_yt)
pca_6_yt = PCA(n_components=6) 
pca_6_yt.fit(x_yt_s) 
x_yt_pca=pca_6_yt.transform(x_yt_s)

x_r_s = scal.fit_transform(x_red)
pca_6_r = PCA(n_components=6) 
pca_6_r.fit(x_r_s) 
x_r_pca=pca_6_r.transform(x_r_s)

#plot of the effects of features on each component
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))

hm3 = sns.heatmap(pca_6_yt.components_,
                  xticklabels=cols[1:12],
                  yticklabels=["PC "+str(x) for x in range(1,7)], 
                  center=0, 
                  cmap=sns.diverging_palette(20, 220), 
                  square=False, ax=ax[0], linewidths=.5);
hm3.set_xticklabels(hm3.get_xticklabels(),rotation=45,horizontalalignment='right');
hm3.set_ylim(6, -0.5);
hm3.set_title("Effect of features on each component for white wine ");
hm3.set(xlabel='Features', ylabel='Components');

hm4 = sns.heatmap(pca_6_r.components_,
                  xticklabels=cols[1:12],
                  yticklabels=["PC "+str(x) for x in range(1,7)], 
                  center=0, cmap=sns.diverging_palette(20, 220), 
                  square=False, ax=ax[1], linewidths=.5);
hm4.set_xticklabels(hm4.get_xticklabels(),rotation=45,horizontalalignment='right');
hm4.set_ylim(6, -0.5);
hm4.set_title("Effect of features on each component for red wine ");
hm4.set(xlabel='Features', ylabel='Components');

I'd like to visualize some of the components in 3D before moving on to KMeans and KMedoids.

In [18]:
fig = px.scatter_3d(fdf_yt, x='pc1', y='pc2', z='pc3',color='good_bad')
fig.show()
In [19]:
fig = px.scatter_3d(fdf_yt, x='pc4', y='pc5', z='pc6',color='good_bad')
fig.show()

Now we can move forward with clustering.

Section 2) Clustering

i) KMeans

The first question we need to tackle is "what is the appropriate number of clusters". The Elbow Method is very popular, so we will employ that first.

In [20]:
wcss_yt=[]
wcss_red=[]
for i in range(1,11):
    km_yt = KMeans(n_clusters=i)
    km_red = KMeans(n_clusters=i)
    km_yt.fit(pc_yt)
    km_red.fit(pc_red)
    wcss_yt.append(km_yt.inertia_)
    wcss_red.append(km_red.inertia_)
In [21]:
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
l3 = sns.lineplot(data = wcss_yt, ax=ax[0], marker='o', color='black', linewidth=2.5);
l3.set_title("KMeans with PCA clustering for white wine");
l3.set(xlabel='Number of clusters', ylabel='WCSS');

l4 = sns.lineplot(data = wcss_red, ax=ax[1], marker='o', color='red', linewidth=2.5);
l4.set_title("KMeans with PCA clustering for red wine");
l4.set(xlabel='Number of clusters', ylabel='WCSS');
In [22]:
kl_yt = KneeLocator(range(1, 11), wcss_yt, curve="convex", direction="decreasing")
kl_red = KneeLocator(range(1, 11), wcss_red, curve="convex", direction="decreasing")

print("Via Elbow Method optimal clusters for white wine is "+str(kl_yt.elbow))
print("Via Elbow Method optimal clusters for red wine is "+str(kl_red.elbow))
Via Elbow Method optimal clusters for white wine is 3
Via Elbow Method optimal clusters for red wine is 5
In [23]:
sc_yt=[]
sc_red=[]
for i in range(2,11):
    km_yt = KMeans(n_clusters=i)
    km_red = KMeans(n_clusters=i)
    km_yt.fit(pc_yt)
    km_red.fit(pc_red)
    score_yt = silhouette_score(pc_yt, km_yt.labels_)
    score_red = silhouette_score(pc_red, km_red.labels_)
    sc_yt.append(score_yt)
    sc_red.append(score_red)
In [24]:
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
l5 = sns.lineplot(data = sc_yt, ax=ax[0], marker='o', color='black', linewidth=2.5);
l5.set_title("KMeans with PCA clustering for white wine");
l5.set(xlabel='Number of clusters', ylabel='Silhouette Score');

l6 = sns.lineplot(data = sc_red, ax=ax[1], marker='o', color='red', linewidth=2.5);
l6.set_title("KMeans with PCA clustering for red wine");
l6.set(xlabel='Number of clusters', ylabel='Silhouette Score');
In [25]:
p_yt, prop_yt = find_peaks(sc_yt, height=0)
print("Via Silhouette Scores optimal clusters for white wine is "+str(p_yt[0]))

p_red, prop_red = find_peaks(sc_red, height=0)
print("Via Silhouette Scores optimal clusters for red wine is "+str(p_red[0])+" and "+str(p_red[1]))
Via Silhouette Scores optimal clusters for white wine is 4
Via Silhouette Scores optimal clusters for red wine is 2 and 5

Now we need to implement KMeans. I will try 3 and 4 clusters for the white wine data, and 2 and 5 clusters for red wine.

In [26]:
x_y = white.loc[:, features].values
x_r = red.loc[:, features].values
x_y_t = scal.fit_transform(x_y)
x_r_t = scal.fit_transform(x_r)
pca.fit(x_y_t)
x_y_t = pca.transform(x_y_t)
pca.fit(x_r_t)
x_r_t = pca.transform(x_r_t)

#employing the kmeans algorithm
k_y_3 = KMeans(n_clusters=3)
k_y_3.fit(x_y_t)
#adding cluster grouping to instance
fdf_yt_c3 = pd.concat([fdf_yt, pd.DataFrame(data=k_y_3.labels_,columns=['cluster']) ], axis = 1)

k_y_4 = KMeans(n_clusters=4)
k_y_4.fit(x_y_t)
fdf_yt_c4 = pd.concat([fdf_yt, pd.DataFrame(data=k_y_4.labels_,columns=['cluster']) ], axis = 1)

k_r_2 = KMeans(n_clusters=2)
k_r_2.fit(x_r_t)
fdf_red_c2 = pd.concat([fdf_red, pd.DataFrame(data=k_r_2.labels_,columns=['cluster']) ], axis = 1)

k_r_5 = KMeans(n_clusters=5)
k_r_5.fit(x_r_t)
fdf_red_c5 = pd.concat([fdf_red, pd.DataFrame(data=k_r_5.labels_,columns=['cluster']) ], axis = 1)

Now that we've done that, I want to visualize the clusters in a 3D plot, along with the means/centers in feature space.

In [27]:
print("KMeans with k=3 for white wine")
fig = px.scatter_3d(fdf_yt_c3, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()

m_3_y = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_y_3.cluster_centers_)),columns=cols[1:13])
m_3_y['cluster'] = m_3_y.index
m_3_y
KMeans with k=3 for white wine
Out[27]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality cluster
0 7.854211 0.486615 0.231009 2.065526 0.073205 16.733835 43.485643 0.995495 3.334832 0.712491 11.089899 6.195138 0
1 10.066685 0.446129 0.460188 3.044719 0.106155 15.214959 48.263761 0.998412 3.183140 0.711599 10.130414 5.690241 1
2 6.905928 0.645478 0.109316 2.406555 0.079944 15.849707 47.064291 0.996060 3.423162 0.558748 10.177532 5.122570 2
In [28]:
print("KMeans with k=4 for white wine")
fig = px.scatter_3d(fdf_yt_c4, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()

m_4_y = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_y_4.cluster_centers_)),columns=cols[1:13])
m_4_y['cluster'] = m_4_y.index
m_4_y
KMeans with k=4 for white wine
Out[28]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality cluster
0 6.675442 0.161838 0.663464 2.349778 0.251767 11.864497 75.157042 0.992680 3.006187 0.739328 10.738386 5.273920 0
1 6.914711 0.645593 0.109368 2.388155 0.080356 15.794058 46.861178 0.996065 3.422832 0.561205 10.173443 5.114746 1
2 7.887478 0.494612 0.222323 2.108900 0.068815 16.894651 43.392770 0.995562 3.338927 0.709029 11.102343 6.232487 2
3 10.188329 0.458348 0.448877 3.048316 0.099415 15.386479 46.756707 0.998640 3.194900 0.710106 10.109262 5.705558 3
In [29]:
print("KMeans with k=2 for red wine")
fig = px.scatter_3d(fdf_red_c2, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()

m_2_r = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_r_2.cluster_centers_)),columns=cols[1:13])
m_2_r['cluster'] = m_2_r.index
m_2_r
KMeans with k=2 for red wine
Out[29]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality cluster
0 9.855026 0.397129 0.463339 2.713288 0.098242 13.984668 38.192504 0.997475 3.207683 0.754344 10.733674 6.013629 0
1 7.429051 0.603626 0.159397 2.437599 0.081216 16.971344 51.267787 0.996324 3.371107 0.602352 10.242770 5.416996 1
In [30]:
print("KMeans with k=5 for red wine")
fig = px.scatter_3d(fdf_red_c5, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()

m_5_r = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_r_5.cluster_centers_)),columns=cols[1:13])
m_5_r['cluster'] = m_5_r.index
m_5_r
KMeans with k=5 for red wine
Out[30]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality cluster
0 7.477857 0.648866 0.118000 2.210268 0.083821 11.740179 34.446429 0.996567 3.373857 0.586750 9.965506 5.258929 0
1 7.278182 0.441152 0.249970 2.207727 0.069606 17.768182 39.681818 0.994596 3.392545 0.692515 11.716263 6.309091 1
2 8.433333 0.529667 0.486000 1.963333 0.347333 14.966667 62.933333 0.997087 3.052667 1.272333 9.476667 5.333333 2
3 8.205263 0.540015 0.291176 3.242724 0.088573 27.417957 91.814241 0.997512 3.286099 0.627926 9.771930 5.272446 3
4 10.703371 0.406531 0.494635 2.772331 0.086854 10.227528 29.137640 0.998301 3.181404 0.714270 10.614232 5.960674 4

ii) KMedoid

I will do 2 clusters for white and red wine respectively, however now it will be done using KMedoids. This is because before it did not feel that the additional clusters were very meaningful for this dataset. For KMeans, the center didn't have to be an actual point in the data, but for KMedoid it will be.

In [31]:
km_y_2 = KMedoids(n_clusters=2)
km_y_2.fit(x_y_t)
fdf_y_c2 = pd.concat([fdf_yt, pd.DataFrame(data=km_y_2.labels_,columns=['cluster']) ], axis = 1)

km_r_2 = KMedoids(n_clusters=2)
km_r_2.fit(x_r_t)
fdf_r_c2 = pd.concat([fdf_red, pd.DataFrame(data=km_r_2.labels_,columns=['cluster']) ], axis = 1)
In [32]:
print("KMedoids with k=2 for white wine")
fig = px.scatter_3d(fdf_y_c2, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()

me_2_y = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(km_y_2.cluster_centers_)),columns=cols[1:13])
me_2_y['cluster'] = me_2_y.index
me_2_y
KMedoids with k=2 for white wine
Out[32]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality cluster
0 10.199284 0.540175 0.355194 2.644709 0.106468 15.762534 30.916112 0.998140 3.154501 0.650658 10.043683 5.463829 0
1 7.563560 0.586455 0.083386 2.658921 0.071030 17.965811 51.479267 0.995647 3.396804 0.681418 10.794333 5.570725 1
In [33]:
print("KMedoids with k=2 for red wine")
fig = px.scatter_3d(fdf_r_c2, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()

me_2_r = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(km_r_2.cluster_centers_)),columns=cols[1:13])
me_2_r['cluster'] = me_2_r.index
me_2_r
KMedoids with k=2 for red wine
Out[33]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality cluster
0 9.6 0.42 0.35 2.1 0.083 17.0 38.0 0.99622 3.23 0.66 11.1 6.0 0
1 7.8 0.59 0.18 2.3 0.076 17.0 54.0 0.99750 3.43 0.59 10.0 5.0 1

When we look, there are definitely some features that change a bit, such as for red wine, there is a point difference between alcohol values between cluster 0 and 1, a nearly 1 point difference between fixed acidity values, and a 16 point difference beween total sulfur density values. Other categories, such as free sulfur dioxide, does not change between clusters. So these changes make a difference bewteen an average wine and a slightly above average wine This, of course, doesn't quite help us on our quest to determine what features and values make a great wine, but they are groupings that would have been difficult for a human to select out.

Now we can move forward with clustering.

Section 3) Supervized Machine Laerning

i) Logistic regression

The classic and most basic is logistic regression when the response variable is categorical in nature, and we employ it here to act as a base comparison for other supervized ml algorithms.

In [34]:
#need to remove `quality` from this data
features = cols[1:12]
x_yt = white.loc[:, features].values

#we need to split the data into train-val-test datasets
#I go for a 70-30 split here
#train to fit the model and test to evaluate it
x_white_train, x_white_test, y_white_train, y_white_test = train_test_split(x_yt, y_yt, train_size=0.7, random_state=random_seed)

#need to scale the data
x_white_train = scal.fit_transform(x_white_train)

#creating and training the model
model_lr_y = LogisticRegression(solver='saga',C=0.01,multi_class='auto',class_weight='balanced',random_state=random_seed).fit(x_white_train,y_white_train);

#evaluating the model
x_white_test = scal.transform(x_white_test)
y_white_pred = model_lr_y.predict(x_white_test)

print("For Logit Reg for white wine the training accuracy is "
      +str(round(model_lr_y.score(x_white_train, y_white_train)*100,2))
      +"% and the test accuracy is "
      +str(round(model_lr_y.score(x_white_test, y_white_test)*100,2))+"%.")
For Logit Reg for white wine the training accuracy is 71.5% and the test accuracy is 70.82%.
In [35]:
#same with red wine data
x_red = red.loc[:, features].values
x_red_train, x_red_test, y_red_train, y_red_test = train_test_split(x_red, y_red, train_size=0.7, random_state=random_seed)
x_red_train = scal.fit_transform(x_red_train)
model_lr_r = LogisticRegression(solver='saga',multi_class='auto',class_weight='balanced',random_state=random_seed).fit(x_red_train,y_red_train);
x_red_test = scal.transform(x_red_test)
y_red_pred = model_lr_r.predict(x_red_test)

print("For Logit Reg for red wine the training accuracy is "
      +str(round(model_lr_r.score(x_red_train, y_red_train)*100,2))
      +"% and the test accuracy is "
      +str(round(model_lr_r.score(x_red_test, y_red_test)*100,2))+"%.")
For Logit Reg for red wine the training accuracy is 78.46% and the test accuracy is 80.21%.

I wanted to check both the test and training accuracy to make sure the training accuracy wasn't much higher than the test accuracy, which would indicate that overfitting was occurring. That, however, is clearly not the case here. Let's check the confusion matrix for a closer look at the false/true positives and false/true negatives.

In [36]:
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=False, figsize=(15,5))
plot_confusion_matrix(model_lr_y, x_white_test, y_white_test, ax=ax1);
ax1.set_title("Confusion matrix for logit reg, white wine");
plot_confusion_matrix(model_lr_r, x_red_test, y_red_test, ax=ax2);
ax2.set_title("Confusion matrix for logit reg, red wine");
In [37]:
print("For white wine:")
print(classification_report(y_white_test, y_white_pred))
print("\n\nFor red wine:")
print(classification_report(y_red_test, y_red_pred))
For white wine:
              precision    recall  f1-score   support

         bad       0.91      0.70      0.79      1141
        good       0.42      0.75      0.53       329

    accuracy                           0.71      1470
   macro avg       0.66      0.72      0.66      1470
weighted avg       0.80      0.71      0.73      1470



For red wine:
              precision    recall  f1-score   support

         bad       0.97      0.79      0.87       413
        good       0.40      0.87      0.55        67

    accuracy                           0.80       480
   macro avg       0.69      0.83      0.71       480
weighted avg       0.89      0.80      0.83       480

These are not bad values, all told. The recall score are all above 0.5, and for red wine it got as high as 0.87, and the closer to 1, the better. However, the precision for good wines is dreadfully low, which means we are mostly classifying everything as a bad wine, and not really picking out what makes a wine good using this machine learning algorithm. Let's take a look to see how the features contribute to the model.

In [38]:
print("Logit Reg coeeficients for white wine model:")
pd.DataFrame(data={'features': features, 'coefficients': model_lr_y.coef_[0]})
Logit Reg coeeficients for white wine model:
Out[38]:
features coefficients
0 fixed acidity 0.087956
1 volatile acidity -0.374543
2 citric acid -0.084929
3 residual sugar 0.360175
4 chlorides -0.252661
5 free sulfur dioxide 0.158252
6 total sulfur dioxide -0.132859
7 density -0.337917
8 pH 0.156734
9 sulphates 0.136284
10 alcohol 0.725050
In [39]:
print("Logit Reg coeeficients for red wine model:")
pd.DataFrame(data={'features': features, 'coefficients': model_lr_r.coef_[0]})
Logit Reg coeeficients for red wine model:
Out[39]:
features coefficients
0 fixed acidity 0.583090
1 volatile acidity -0.564933
2 citric acid -0.139361
3 residual sugar 0.346403
4 chlorides -0.299354
5 free sulfur dioxide 0.075288
6 total sulfur dioxide -0.465169
7 density -0.470477
8 pH 0.034069
9 sulphates 0.642206
10 alcohol 0.950053

For both, alcohol positively contributes to the quality of the wine, as suspected. For red wine, fixed acidity also positively contributes whereas it doesn't at all for white wine. For both, volatile acidity negatively contributes but more so for red wine, and the same for denisty. For red wine, while sulphates contribute quite a bit to classifying a wine as good, it only weakly does so for white wine. Thus, as we also suspected, what makes for a good red wine does not necessarily make a good white wine...

However, the precision for good wines with this logit reg models is not so great, so now we turn to Decision Trees to see if we can't do better.

ii) DecisionTreeClassifier

In [40]:
features = cols[1:12]
x_yt = white.loc[:, features].values
x_white_train, x_white_test, y_white_train, y_white_test = train_test_split(x_yt, y_yt, train_size=0.7, random_state=random_seed)

decision_tree_yt = DecisionTreeClassifier(criterion='gini',splitter='random',random_state=random_seed,min_samples_leaf=10);
decision_tree_yt.fit(x_white_train, y_white_train);
In [41]:
acc_decision_tree_yt = decision_tree_yt.score(x_white_test, y_white_test)
print("Decision tree accuracy for white wine is "+str(round(acc_decision_tree_yt*100,3)))
Decision tree accuracy for white wine is 80.136
In [42]:
x_red = red.loc[:, features].values
x_red_train, x_red_test, y_red_train, y_red_test = train_test_split(x_red, y_red, train_size=0.7, random_state=random_seed)

decision_tree_red = DecisionTreeClassifier(criterion='gini',splitter='random',random_state=random_seed);
decision_tree_red.fit(x_red_train, y_red_train);
In [43]:
acc_decision_tree_red = decision_tree_red.score(x_red_test, y_red_test)
print("Decision tree accuracy for red wine is "+str(round(acc_decision_tree_red*100,3)))
Decision tree accuracy for red wine is 85.625

We have a problem, however. This tree is incredibly deep and very difficult to interpret. Often times when the gini criterion equals 0, theres 1 or 4 samples in the node. How is that meaningful? Therefor, for something easier to interpret visually, we must limit the depth of the tree.

In [44]:
depth=3
m_split = 5
feat = 'sqrt'
decision_tree_y = DecisionTreeClassifier(criterion='gini',splitter='random',max_depth=depth,min_samples_leaf=m_split,max_features=feat,random_state=random_seed);
decision_tree_y.fit(x_white_train, y_white_train);

acc_decision_tree_y = decision_tree_y.score(x_white_test, y_white_test)
print("Decision tree with a max depth of "+str(depth)+" and minimum number of samples of "+str(m_split)+" accuracy for white wine is "+str(round(acc_decision_tree_y*100,3))+"%.")
Decision tree with a max depth of 3 and minimum number of samples of 5 accuracy for white wine is 77.755%.
In [45]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(decision_tree_y, 
                   feature_names=features,  
                   class_names=['good','bad'],
                   filled=True, label='all',impurity=True)

fig.savefig("dt_white_depth3_samp5.png")
In [46]:
decision_tree_r = DecisionTreeClassifier(criterion='gini',splitter='random',max_depth=depth,min_samples_leaf=m_split,max_features=feat,random_state=3);
decision_tree_r.fit(x_red_train, y_red_train);

acc_decision_tree_r = decision_tree_r.score(x_red_test, y_red_test)
print("Decision tree with a max depth of "+str(depth)+" and minimum number of samples of "+str(m_split)+" accuracy for red wine is "+str(round(acc_decision_tree_r*100,3))+"%.")
Decision tree with a max depth of 3 and minimum number of samples of 5 accuracy for red wine is 84.792%.
In [47]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(decision_tree_r, 
                   feature_names=features,  
                   class_names=['good','bad'],
                   filled=True, label='all',impurity=True)

fig.savefig("dt_red_depth3_samp5.png")

If we ignore trying to make human-readable and -interpretable decision trees, technically the accuracy is better than logit reg.

In [ ]: